Analysis of Language Variation Using a Large-Scale Corpus of Spontaneous Speech
نویسندگان
چکیده
Large-scale corpus of spontaneous speech can be a powerful tool for the study of language variation. Moreover, given that the corpus is publicly available, corpus-based analysis could open up the possibility of follow-up analysis in this area of linguistic study. Generally speaking, follow-up study is highly desirable in sciences but so far it has been virtually impossible in the area of socio-linguistics due to the lack of shared corpus. In this paper, I will present some results of the analyses of the Corpus of Spontaneous Japanese (CSJ) that we developed in the years 1999-2003. CSJ is a large, richly annotated corpus of spontaneous speech of present-day Japanese (http://www2.kokken.go.jp/~csj/public/index.html), containing more than 660 hours of speech uttered by more than 1400 speakers. This corpus was designed primarily for statistical machine learning of acousticand language-models for automatic spontaneous speech recognition, but it was also designed for the study of language variation. So far, we have analyzed variations at different levels of language structures including, vowel devoicing, pitch-accent location in adjectives, coalescence of particle succession, moraic nasalization of particles, diffusion of the new potential verb forms, choice of phrase-final boundary pitch movements (BPM), and strength of the prosodic boundary preceding accented particle. In addition to these, analysis of word-form variation was conducted. The last analysis was concerned not only with individual lexical items, but also with the lexicon as a whole.
منابع مشابه
Why Is the Recognition of Spontaneous Speech so Hard?
Although speech, derived from reading texts, and similar types of speech, e.g. that from reading newspapers or that from news broadcast, can be recognized with high accuracy, recognition accuracy drastically decreases for spontaneous speech. This is due to the fact that spontaneous speech and read speech are significantly different acoustically as well as linguistically. This paper reports anal...
متن کاملBenchmark Test for Speech Recognition Using the Corpus of Spontaneous Japanese
We present benchmark results of automatic speech recognition using the Corpus of Spontaneous Japanese (CSJ), which has been developed in the five-year national project and will be the largest spontaneous speech databases. New test-sets are designed for both academic presentation speech and extemporaneous public speech, which are the two major categories in the corpus. The testsets are selected ...
متن کاملTraining a Language Model Using Webdata for Large Vocabulary Japanese Spontaneous Speech Recognition
This paper describes a language modeling method using largescale spoken language data retrieved from the Web for spontaneous speech recognition. We downloaded 15 million Web pages on a comprehensive range topics. Next, spoken languagelike texts were selected from the downloaded Web data using the naı̈ve Bayes classifier, and typical linguistic phenomena such as fillers and pauses were added usin...
متن کاملWhy is Automatic Recognition of Spontaneous Speech So Difficult?
Although speech derived from reading texts, and similar types of speech, e.g. that from reading newspapers or that from news broadcast, can be recognized with high accuracy, recognition accuracy drastically decreases for spontaneous speech. This is due to the fact that spontaneous speech and read speech are significantly different acoustically as well as linguistically. This paper reports analy...
متن کاملUse of a Large-scale Spontaneous Speech Corpus in the Study of Linguistic Variation
Corpus of Spontaneous Japanese, or CSJ, is a large-scale database of spontaneous Japanese. It contains speech signal and transcription of about 7 million words along with various annotations like POS and phonetic labels. After describing its design issues, the potential of the CSJ as a resource for linguistic variation study was evaluated.
متن کامل